# Chinese Image-Text Retrieval
Chinese Clip Vit Large Patch14 336px
Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.
Text-to-Image
Transformers

C
OFA-Sys
713
23
Chinese Clip Vit Base Patch16
The base version of Chinese CLIP, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder, trained on a large-scale dataset of approximately 200 million Chinese image-text pairs.
Text-to-Image
Transformers

C
OFA-Sys
49.02k
104
Taiyi CLIP RoBERTa 326M ViT H Chinese
Apache-2.0
The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with RoBERTa-large architecture as the text encoder.
Text-to-Image
Transformers Chinese

T
IDEA-CCNL
108
10
Taiyi CLIP Roberta Large 326M Chinese
Apache-2.0
The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, supporting Chinese image-text feature extraction and zero-shot classification
Text-to-Image
Transformers Chinese

T
IDEA-CCNL
10.37k
39
Taiyi CLIP Roberta 102M Chinese
Apache-2.0
The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with a text encoder based on RoBERTa-base architecture.
Text-to-Image
Transformers Chinese

T
IDEA-CCNL
558
51
Mengzi Oscar Base Retrieval
Apache-2.0
A Chinese image-text retrieval model fine-tuned on the COCO-ir dataset based on the Chinese multimodal pretraining model Mengzi-Oscar
Text-to-Image
Transformers Chinese

M
Langboat
17
3
Featured Recommended AI Models